{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "10.6. 求近义词和类比词.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "authorship_tag": "ABX9TyM0nkyYN3jK4KOXDtRyOr7j",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/chongzicbo/Dive-into-Deep-Learning-tf.keras/blob/master/10.6.%20%E6%B1%82%E8%BF%91%E4%B9%89%E8%AF%8D%E5%92%8C%E7%B1%BB%E6%AF%94%E8%AF%8D.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ExyEVWzu-y_f",
        "colab_type": "text"
      },
      "source": [
        "##10.6. 求近义词和类比词\n",
        "在“word2vec的实现”一节中，我们在小规模数据集上训练了一个word2vec词嵌入模型，并通过词向量的余弦相似度搜索近义词。实际中，在大规模语料上预训练的词向量常常可以应用到下游自然语言处理任务中。本节将演示如何用这些预训练的词向量来求近义词和类比词。我们还将在后面两节中继续应用预训练的词向量。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "dIivsom8jeD0",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 83
        },
        "outputId": "38f31e28-36d1-400d-9f67-8cef499c6f47"
      },
      "source": [
        "import tensorflow as tf\n",
        "from tensorflow import keras\n",
        "import numpy as np\n",
        "import tarfile\n",
        "print(tf.__version__)"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<p style=\"color: red;\">\n",
              "The default version of TensorFlow in Colab will soon switch to TensorFlow 2.x.<br>\n",
              "We recommend you <a href=\"https://www.tensorflow.org/guide/migrate\" target=\"_blank\">upgrade</a> now \n",
              "or ensure your notebook will continue to use TensorFlow 1.x via the <code>%tensorflow_version 1.x</code> magic:\n",
              "<a href=\"https://colab.research.google.com/notebooks/tensorflow_version.ipynb\" target=\"_blank\">more info</a>.</p>\n"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "1.15.0\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_RKJw97Bj2nC",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "tf.enable_eager_execution()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "n5vOPS8T-5h5",
        "colab_type": "text"
      },
      "source": [
        "###10.6.1. 使用预训练的词向量\n",
        "先下载预训练的中文词向量文件"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "z4TLGysLj5nB",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "w2v_glove=keras.utils.get_file('w2v_glove','https://ai.tencent.com/ailab/nlp/data/Tencent_AILab_ChineseEmbedding.tar.gz') #下载词向量"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "H2Kb77b5kS6a",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 92
        },
        "outputId": "aa5a85c9-d620-40bc-dc45-29b9f7895128"
      },
      "source": [
        "ll /root/.keras/datasets/"
      ],
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "total 22970976\n",
            "-rw-r--r-- 1 root        1805 Oct 19  2018 README.txt\n",
            "-rw-r--r-- 1 root 16743325366 Oct 19  2018 Tencent_AILab_ChineseEmbedding.txt\n",
            "-rw-r--r-- 1 root  6778940358 Jan 20 06:24 w2v_glove\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "q7FP2BOCn_Rz",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "tar=tarfile.open('/root/.keras/datasets/w2v_glove','r:gz')\n",
        "file_names=tar.getnames()\n",
        "for file_name in file_names:\n",
        "  tar.extract(file_name,'/root/.keras/datasets/')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PS-1Gqq5_Fpp",
        "colab_type": "text"
      },
      "source": [
        "该词向量文件较大，我们仅加载100000个单词的词向量"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "s_lhGnMQo0ma",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "embedding_matrix=np.zeros(shape=[100000,200],dtype='float32')\n",
        "id_to_word={}\n",
        "word_to_id={}\n",
        "f=open(\"/root/.keras/datasets/Tencent_AILab_ChineseEmbedding.txt\")\n",
        "i=0\n",
        "for line in f:\n",
        "  if i==100000:\n",
        "    break\n",
        "  values=line.split()\n",
        "  if len(values)<201:\n",
        "    continue\n",
        "  id_to_word[i]=values[0]\n",
        "  word_to_id[values[0]]=i\n",
        "  #每行的第一个元素是词，后面才是词向量，因此将values[0]与values[1]分开存放\n",
        "  coefs=np.asarray(values[1:],dtype='float32')\n",
        "  embedding_matrix[i]=coefs\n",
        "  i+=1\n",
        "\n",
        "\n",
        "f.close()\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qsApeYGU_VzR",
        "colab_type": "text"
      },
      "source": [
        "打印一个词向量看看，词向量维度为200"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1TAUWXDNuAf_",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 655
        },
        "outputId": "89b5a7dd-1cb8-48f1-9d80-e876b4d9b037"
      },
      "source": [
        "embedding_matrix[-2],list(vocablary.values())[-2]"
      ],
      "execution_count": 62,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(array([ 0.468819, -0.375571,  0.134898,  0.275068, -0.046392,  0.397821,\n",
              "        -0.131702,  0.122381, -0.402547, -0.857406, -0.347803,  0.193127,\n",
              "        -0.381756, -0.322535,  0.320485, -0.141673, -0.415495, -0.111749,\n",
              "         0.004892,  0.244213,  0.109055,  0.054361, -0.077072,  0.432907,\n",
              "         0.077875, -0.012601, -0.22354 ,  0.179577,  0.788958,  0.503813,\n",
              "        -0.292133, -0.103239,  0.351905, -0.091495,  0.195276,  0.311701,\n",
              "         0.196822,  0.003728, -0.268067,  0.036879,  0.160408,  0.129077,\n",
              "         0.722161,  0.296376, -0.039628, -0.612891, -0.318631, -0.653084,\n",
              "        -0.044348, -0.117665, -0.29183 , -0.515927,  0.219945,  0.097836,\n",
              "         0.783008, -0.094529, -0.207038,  0.290682,  0.289661,  0.161448,\n",
              "         0.21637 ,  0.030424,  0.290693,  0.111558, -0.427859, -0.315404,\n",
              "         0.145403,  0.252281,  0.027503, -0.119779, -0.493411, -0.153594,\n",
              "         0.453617,  0.291261,  0.034325, -0.078553,  0.022672, -0.220923,\n",
              "         0.091114, -0.038166,  0.129602, -0.193305,  0.080152,  0.774044,\n",
              "        -0.137743, -0.401648, -0.062281,  0.336899,  0.222145, -0.379443,\n",
              "         0.514574,  0.219336, -0.06313 , -0.11504 ,  0.282494, -0.062637,\n",
              "        -0.068186, -0.252185, -0.309412, -0.4416  ,  0.187851,  0.071766,\n",
              "         0.048593, -0.249546, -0.11645 ,  0.463556, -0.446687, -0.769962,\n",
              "        -0.074112,  0.608788,  0.261987,  0.25335 , -0.28341 ,  0.137621,\n",
              "        -0.046031,  0.745483, -0.176567, -0.208275,  0.387798, -0.41794 ,\n",
              "        -0.261843, -0.279664,  0.069749, -0.338492,  0.607514,  0.261252,\n",
              "        -0.390286,  0.185301,  0.112305,  0.241253,  0.036819,  0.016036,\n",
              "        -0.235715,  0.377717, -0.119861, -0.362685, -0.016579, -0.076662,\n",
              "        -0.320542,  0.210289,  0.061279, -0.392736, -0.558356, -0.536845,\n",
              "         0.321066, -0.239696, -0.384893,  0.488446,  0.172002, -0.30063 ,\n",
              "        -0.092971,  0.590544, -0.031005, -0.067772, -0.481317,  0.548407,\n",
              "         0.361604, -0.048452,  0.095769,  0.023929, -0.086233, -0.379713,\n",
              "        -0.603542,  0.21503 ,  0.465758, -0.081052, -0.057754,  0.31629 ,\n",
              "         0.320508,  0.182844,  0.581269, -0.128856, -0.178356, -0.094118,\n",
              "        -0.09082 , -0.335774, -0.227609,  0.203667,  0.129903,  0.142183,\n",
              "        -0.334078, -0.247655,  0.002846, -0.141992,  0.237982,  0.178088,\n",
              "         0.049336, -0.144875, -0.383344, -0.285463,  0.430627, -0.108355,\n",
              "        -0.056668, -0.389521, -0.156825,  0.371938,  0.081399, -0.223946,\n",
              "        -0.219285,  0.502285], dtype=float32), '办理签证')"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 62
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kxha4KLB_hdS",
        "colab_type": "text"
      },
      "source": [
        "###10.6.2. 应用预训练词向量\n",
        "下面我们以GloVe模型为例，展示预训练词向量的应用。\n",
        "\n",
        "####10.6.2.1. 求近义词\n",
        "这里重新实现“word2vec的实现”一节中介绍过的使用余弦相似度来搜索近义词的算法。为了在求类比词时重用其中的求 k 近邻（ k -nearest neighbors）的逻辑，我们将这部分逻辑单独封装在knn函数中。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CJJ-Nf7wzFfa",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def knn(W,x,k):\n",
        "  #添加的1e-9是为了数值稳定性\n",
        "  cos=np.dot(W,x.reshape((-1,)))/(np.sqrt(np.sum(W*W,axis=1)+1e-9)*np.sqrt(np.sum(x*x)))\n",
        "  return np.argsort(-cos)[0:k]#降序排序"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "U9vevWti3qdK",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "x=embedding_matrix[1]\n",
        "topk=knn(embedding_matrix,x,k=3)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mTgQY-5P_oJx",
        "colab_type": "text"
      },
      "source": [
        "然后，我们通过预训练词向量来搜索近义词。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "dgJD8z6r35h7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def get_similar_tokens(query_token,k,embedding_matrix):\n",
        "  id=word_to_id[query_token]\n",
        "  x=embedding_matrix[id]\n",
        "  topk=knn(embedding_matrix,x,k+1)\n",
        "  for i in topk[1:]:\n",
        "    print(id_to_word[i])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pK_cVMPz4dSC",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 73
        },
        "outputId": "9b248af9-6b2f-46a0-efdf-6b341a169f8f"
      },
      "source": [
        "get_similar_tokens('办理',3,embedding_matrix)"
      ],
      "execution_count": 72,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "手续\n",
            "申请办理\n",
            "办理手续\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xmIoVxPg_vA7",
        "colab_type": "text"
      },
      "source": [
        "####10.6.2.2. 求类比词\n",
        "除了求近义词以外，我们还可以使用预训练词向量求词与词之间的类比关系。例如，“man”（男人）: “woman”（女人）:: “son”（儿子） : “daughter”（女儿）是一个类比例子：“man”之于“woman”相当于“son”之于“daughter”。求类比词问题可以定义为：对于类比关系中的4个词$a : b :: c : d$，给定前3个词 $a 、 b$ 和 $c$ ，求 $d$ 。设词 $w$ 的词向量为$ vec(w)$ 。求类比词的思路是，搜索与$\\text{vec}(c)+\\text{vec}(b)-\\text{vec}(a)$的结果向量最相似的词向量。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "p3qLKfyK7asr",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def get_analogy(token_a,token_b,token_c,embedding_matrix):\n",
        "  vec_a=embedding_matrix[word_to_id[token_a]]\n",
        "  vec_b=embedding_matrix[word_to_id[token_b]]\n",
        "  vec_c=embedding_matrix[word_to_id[token_c]]\n",
        "  x=vec_c+vec_b-vec_a\n",
        "  topk=knn(embedding_matrix,x,2)\n",
        "  for i in topk[0:]:\n",
        "    print(id_to_word[i])\n",
        "  # return id_to_word[topk[0]]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mlZodIrKAF5U",
        "colab_type": "text"
      },
      "source": [
        "验证一下“男-女”类比。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tUa1HZi49H8p",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 55
        },
        "outputId": "f0022f7b-7a2b-4e27-f92d-4bb676bacc11"
      },
      "source": [
        "get_analogy('男人','女人','儿子',embedding_matrix)"
      ],
      "execution_count": 89,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "儿子\n",
            "女儿\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FiwtR4cyAOPi",
        "colab_type": "text"
      },
      "source": [
        "“首都-国家”类比"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "uU-2aBKB9fRS",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 55
        },
        "outputId": "904ec9b4-df87-4097-b487-11a990f1b776"
      },
      "source": [
        "get_analogy('中国','北京','日本',embedding_matrix)"
      ],
      "execution_count": 94,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "北京\n",
            "东京\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HxfPglDo-c4i",
        "colab_type": "text"
      },
      "source": [
        "###10.6.3. 小结\n",
        "* 在大规模语料上预训练的词向量常常可以应用于下游自然语言处理任务中。\n",
        "* 可以应用预训练的词向量求近义词和类比词。"
      ]
    }
  ]
}